Dollars for Votes - the US presidential elections

by martin-martin

=====================================================================================

I’m currently spending my time in the USA, and it’s (pre-)election time! Having a memebership to a 24h gym gets me exposed to a massive amount of politics on TV while I’m doing my time on the crosstrainer or even when I’m just changing in the dressing room. It’s February, the elections are in November, but there is a near-constant broadcast of candidates battling each other on CNN.

I’m impressed and surprised about this omnipresence of the elections on TV, already so early on. And I’m concerned about the rudeness that these debates often display. Politics in Europe are different. They’re not golden, nope. And there are insults and dirty tricks played also there. But here in the US there’s a level of “attack”, “aggression” and “god” in the talks, that I am thoroughly not used to. :/

One of the times while in the dressing room, I got stuck watching the TV talk about Dollars per Vote, and found this comparison very interesting. Some candidates spend very little money as compared to others in absolute, and yet in the end some end up spending still more money per vote received.

Now, my analysis will not deal with this comparison, since the ballots have not been cast yet (for a few months…), but I will make a preliminary analysis of the money that the presidential candidates were receiving from individual contributors in the run-up for the presidency.


Pt. 0: The Data

The data for the 2016 Presidential Campaign Contributions can be found here: http://fec.gov/disclosurep/PDownload.do Since I am currently in California, I chose and downloaded the dataset for CA.zip.

Here’s also a peek at the structure of the data that the .csv contains. It’s an excerpt from this file: ftp://ftp.fec.gov/FEC/Presidential_Map/2016/DATA_DICTIONARIES/CONTRIBUTOR_FORMAT.txt I have also indicated which of the columns I ended up dropping.

The text file is comma delimited and uses double-quotation marks as the text qualifier.

------------------------------------------------------------------------ 

CMTE_ID                 COMMITTEE ID                                      S <- dropped
CAND_ID                 CANDIDATE ID                                      S
CAND_NM                 CANDIDATE NAME                                    S
CONTBR_NM               CONTRIBUTOR NAME                                  S
CONTBR_CITY             CONTRIBUTOR CITY                                  S
CONTBR_ST               CONTRIBUTOR STATE                                 S <- dropped
CONTBR_ZIP              CONTRIBUTOR ZIP CODE                              S
CONTBR_EMPLOYER         CONTRIBUTOR EMPLOYER                              S
CONTBR_OCCUPATION       CONTRIBUTOR OCCUPATION                            S
CONTB_RECEIPT_AMT       CONTRIBUTION RECEIPT AMOUNT                       N
CONTB_RECEIPT_DT        CONTRIBUTION RECEIPT DATE                         D 
RECEIPT_DESC            RECEIPT DESCRIPTION                               S <- dropped
MEMO_CD                 MEMO CODE                                         S <- dropped
MEMO_TEXT               MEMO TEXT                                         S <- dropped
FORM_TP                 FORM TYPE                                         S <- dropped
FILE_NUM                FILE NUMBER                                       N <- dropped
TRAN_ID                 TRANSACTION ID                                    S <- dropped
ELECTION_TP             ELECTION TYPE/PRIMARY GENERAL INDICATOR           S <- dropped


Data Type:  S = string (alpha or alpha-numeric); D = date; N = numeric  

------------------------------------------------------------------------ 

One thing that becomes very obvious here, and that strongly influenced my exploration, is that there is only one column containing numerical variables: contb_receipt_amt.

This is the column that holds the amounts of the contributions in $, and it is therefore also one of the most interesting columns. A lot of comparisons were anchored to it.

However, there is also another “hidden” numerical value associated with the dataset, and that is the number of rows, that allows many graphs to be created in combination with the categorical values of the other columns.


Ideas and Interests that I will explore

  • Who got the most contributions? Which person, which party, which gender?

  • Which city contributed the most, and to whom (also: in relation to the number of citizens)

  • How many “NOT EMPLOYED” people contributed to the campaigns as compared to people with employment; Also: whom did the “NOT EMPLOYED” support the most?


Pt. 1: Mini-Wrangling

In order to be able to load the data, I had to add a comma at the end of the header row in my .csv file.


Adding new columns: cand_gender and cand_party

For some of my questions, I am interested to take a look at the data also depending on the gender and the party affiliation of the different candidates. Since these columns are not present in the dataset, I went to gather the required information online and added it as two new columns to my data.frame df.

First I’ll be mapping Candidate names to Candidate IDs. This I’ll do so that I will be able to refer to a cand_id when passing a function, which is easier than writing out the full name.

Now I can add this new column to the data.frame. There is one member of the green party ‘G’, the republicans will get the value ‘R’ and the democrats the value ‘D’


Removing obsolete columns

There is quite a big amount of data that I am not intending to use for my exploration. These include e.g. conbr_st, which is in my case always CA, or file_num, a unique number used to link the transactions to the reports (which I’ll also drop).


Pt. 2: Exploration

After having cleaned and adapted the dataset to my wishes, I can start into my exploration.

For example, I can query how many individual contributions did each candidate receive, e.g. Hillary Clinton:

## [1] 42063

QUESTION 1: Who got the most money through contributions?

In this section I’ll take a look at money and amounts. I’m wondering which candidate got the most contributions, and is it the same person who got the most money through contributions? How is the size of the contributions distributed in regards to the candidates? Which gender received more contributions, or more money? Which political party?


PART 1: the mean and distributions

Constructing a frequency plot makes me see something about Sanders, Bernard (who has a big dot low down). Some candidates have a small distribution, and some a wider one, there is another person with a big dot at the bottom, yet also a quite big one higher up (i think this is Clinton, Hillary); and further one person with a very high distribution up and down, but also a strong base it seems (Cruz, Ted?).

Cleaned up the graph a bit and reordered ascending! This plot gives a nice overview of the mean contributions per candidate! Maybe it’s a keeper ; )


PART 2: amounts

Most of the contributions are below 1000 $!

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


PART 2.1: refunds

The previous graph gives me negative values for amounts, which is at first very confusing. So I construct a vector holding only the refunds.

## [1] 1498

Seems there are nearly 1500 “contributors” who actually got more money refunded than what they gave.

And some gave some, but also got some back (or got some back multiple times?):

## [1] 2102

So I tried a plot for these refunds (which was too big, so I’ll take a look at the most Dollars)

This is a lot of money to get back. I wonder why and how. But I must admit I don’t really understand these US politics and fundings of the candidates Here’s some info: https://ballotpedia.org/California_Proposition_34,_Limits_on_Campaign_Contributions_(2000)


PART 2.2: contributions

Well, so let’s see who gave the most:

## [1] 24

There are 24 people that all gave the same amount, so I suspect that there is an upper limit around 10.000$ (maybe without taxes, or such? They should be alread PACs?)

I also spotted potentially erroneous data, with the contributor DE GROOTE, DOUG MR. being listed, as well as a DE GROOTE, DOUG, both with the same amount, which makes me believe that it is a mistake in the data (or a rather badly executed way of increasing one’s contribution limit…)

Maybe can check :)

##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE
## [23]  TRUE  TRUE

Yep, seems that some of those transactions are listed more often in the dataset. Actually quite a lot of them! 7 TRUE!

It seems that DE GROOTE could be also a company, because it’s listed also among the contbr_employer column. Well. But I’m not gonna go hunt down these individuals :)


PART 3: gender and money

Number of contributions per gender of candidate

Total amount of money through contributions per gender of candidate

Whoa! seems that females received nearly as much money in contributions as males did, even though there are only 3 female vs. 19 male candidates + there were way less fewer contributions to femal candidates than to male candidates!


PART 3.1: binning

It could be interesting to take a look at the mean contribution per F/M candidate.

So here are the summaries for the contributions for the female candidates, then for the male candidates, and finally for the whole dataset combined:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -5400.0    25.0   100.0   501.7   250.0  5400.0
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -10000.0     25.0     50.0    179.2    100.0  10800.0
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -10000.0     25.0     50.0    262.4    100.0  10800.0

I’ll start with a generous binwidth of 1000:

This graphic is pretty useless - nearly every contribution falls within the first bin, there are very few that are above 1000. So i can adjust the binnning:

Whoa! Most contributions are actually between 0-100$!! So there are many many small contributions that were made in CA. Let’s look at this data in a table:

## 
##         (0,100]       (100,200]       (200,300]       (300,400] 
##          135435            7100           12498             493 
##       (400,500]       (500,600]       (600,700]       (700,800] 
##            5781             140             214             144 
##       (800,900]     (900,1e+03]   (1e+03,2e+03]   (2e+03,3e+03] 
##              47            4777            1448            9823 
## (3e+03,1.1e+04] 
##             445

Interesting, because I already saw that most low contributions went to one candidate: Sanders, Bernard.


PART 3.2: proportions

I’ll go forward displaying proportions of female/male candidates in the respective parties

The proportions on how many candidates of the respective gender are running in the presidential elections is very different between the three parties.

Let’s see a graph plotting the statistical percentage of money per candidate per gender

But of course that’s VERY misleading!… Clinton, Hillary got so much of the amounts contributed to female candidates, that we should make this more clear:

As we can see, Hillary Clinton received by far the largest amount of money through contributions. She nearly single-handedly takes half of all the contributions made.


PART 4: party money

I can also see, that Clinton and Sanders make the two biggest single sections, and they are both with the Democrats. Therefore it would be interesting to plot money per party instead of by gender.

Here we can see that the Democrats received way more money through contributitons than the Republicans did, and that Hillary Clinton alone received more money than all the Republican candidates combined. Whereas the contributions to the green party are so tiny, that they become invisible in this visualization.


PART 5: candidate money

Number of contributions per candidate:

## 
##                 Bush, Jeb       Carson, Benjamin S. 
##                      2762                     21045 
##  Christie, Christopher J.   Clinton, Hillary Rodham 
##                       316                     42063 
## Cruz, Rafael Edward 'Ted'            Fiorina, Carly 
##                     21645                      4426 
##        Graham, Lindsey O.            Huckabee, Mike 
##                       331                       447 
##             Jindal, Bobby           Kasich, John R. 
##                        31                       701 
##          Lessig, Lawrence   O'Malley, Martin Joseph 
##                       372                       383 
##         Pataki, George E.                Paul, Rand 
##                        20                      4117 
##    Perry, James R. (Rick)              Rubio, Marco 
##                       116                      7994 
##          Sanders, Bernard      Santorum, Richard J. 
##                     72179                        79 
##               Stein, Jill          Trump, Donald J. 
##                        85                       590 
##             Walker, Scott     Webb, James Henry Jr. 
##                       670                       106
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

While most are vanishingly small in comparison, four candidates stick out in terms of number of contributions.

Let’s look at the money amassed through the contributions:

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

These plots are not very interesting and don’t tell me too much new, so enough of this topic for now.


QUESTION 2: Which city contributed the most, and to whom?

The next one is actually quite interesting, because it shows how people often give full-number-amounts of money (with the distinct lines going vertical).

Here i’ll apply geom_jitter() only to the actual contributions, leaving out the refunds.

It nicely shows how much do most people donate (first big bar to the right of 0), about which cities had the most contributions (horizontal black lines), and where are there common discrete jumps, maybe related to regulations such as donation limits (vertical lines). This is a nice graph :)

Trying to remove cities and keep only the onese with high contribution amounts:

Hm… this is not so interesting, so maybe I should rather right away put it in relation with the number of inhabitants.


Getting Population estimates for CA

I found some here: https://www.census.gov/popest/data/cities/totals/2012/SUB-EST2012.html The data is for 2012, but it’s the most current one that is listing cities that I could find there. It’s not perfect, but for now I just want to take a look :)

Could be interesting to display this with a facet_wrap(). However, there are too many cities. So i should maybe try this instead:

  1. calculating the ratios for contributions/inhabitant
  2. subsetting into categories of lower quartile, mean, upper quartile
  3. showing one plot for maybe the mean of each group

Calculating the Ratios

Oops, there is also data for “counties” in here, that have now the same name as some of the cities… so I’ll have to remove the rows with the higher values for capita.


Question 3: How much and to whom did NOT EMPLOYED give

I am very perplexed and involved with the topic of homeless people in the US, so anything that goes into that direction rings a bell with me. Here I’m trying to investigate a little bit into the political direction that homeless might be having a tendency for.

However, I understand that this is highly hypothetical, because I only have data of monetary contributions, that “NOT EMPLOYED” people gave for the presidential campaigns. Of course giving money != political orientation (It might be a good proxy, however what I’m trying to say is, that it’s not an exhaustive factor. Many people might have a political orientation, however did not contribute to the campaigns monetarily. This might be especially true for homeless people, who are very likely to have very little money at their hands). Further, NOT EMPLOYED != homeless. There are quite a few people that are employed, but homeless in the US. Assuming that they would give a contribution, they would fall into a different category.

These graphs are not gonna show much about how much money was flowing, but rather is intended as a proxy on where does a certain section of society lean towards politically.

Who gave how much?

Okay, this is not so exciting :) Wait…

## 
##     employed not employed 
##       159985        20446

There are very few NOT EMPLOYED compared to those with employment, so I’ll need ratios

Displaying proportions of which party did people with/without employment give contributions to:

## Warning: Removed 47 rows containing non-finite values (stat_sum).

Interesting: it seems that a much higher percentage of people without employment contribute to the Democrats.

NOTE: I boosted the max_size variable, in order to make it more clear how vanishingly small is the percentage of NOT EMPLOYED people that contributed to the Republican party.


So, let’s see for whom

## Warning: Removed 47 rows containing non-finite values (stat_sum).

Sanders, Bernard gets percentage-wise the most contributions from the NOT EMPLOYED! And it seems that apart from Trump, Donald J. there is no Republican currently left in the ballots who got contributions by NOT EMPLOYED people.

When increasing the max_size also here it becomes even more clear how great the difference between Sanders, Bernard and the other candidates is in this aspect.

NOTE: NOT EMPLOYED probably also includes college students! Which makes sense because Sanders wants to take away college debt, so he has a big bunch of the students on his side (I heard of 83% somewhere).

Let’s look at this in another way:

This sounds like a pie chart, haha, also with the implications of who gets the biggest piece :)


Thinking about occupation-voter distribution, now I’m wondering which party do IT people lean towards. :)

So I was collecting all the unique jobs present in the data.frame.

## [1] 8958

Haha, found one row listing with contb_occupation : GRANDPA !! :)

But well, these are too many, and seems that people just put what they wanted. It hasn’t been scanned through and grouped it seems. I don’t want to get into this. So let’s rather wrap it up :)


Discussion of the Analysis

Structure of the dataset

The dataset consists mainly of categorical variables, such as cand_nm, contbr_occupation or contbr_zip. All these are interesting to put in context with the one continuous variable cont_receipt_amt, and of course the amount of rows.

A more thorough overview of the data can be taken from the official site: ftp://ftp.fec.gov/FEC/Presidential_Map/2016/DATA_DICTIONARIES/CONTRIBUTOR_FORMAT.txt

Main feature(s) of interest in the dataset

How is the number of individual contributions and the amount of money gained through them spread out among the different candidates. In further abstraction it is interesting to look at the same with aggregating or subsetting the dataset (e.g. by gender of the candidate, or by origin city of the contributor).

Other features that help to support the investigation into the features of interest

I found it interesting to add the population count of the cities in CA, to be able to calculate a ratio regarding how many contributions and how much money in total, respectively, was given by city in relations to the number of inhabitants.

New variables created (indirectly) from existing variables

I created three new columns containing some new variables:

  • cand_gender, defining the gender of the presidential candidate ('F' for “female” and 'M' for “male”)
  • cand_party, defining the party affiliation of the candidate ('D' for “Democrats”, 'R' for “Republicans”, and 'G' for “Greens”)
  • POPESTIMATE2012, containing the population estimates of the CA cities for 2012, taken from a different governmental source online
  • contb_emp_status, where I reduced the contbr_occupation to show only whether a contributor was employed or not

Investigated features: unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?


Relationships observed in the bivariate part of the investigation. Variations of the investigated features in relation with other features of the dataset.

Interesting relationships between other features (not the main feature(s) of interest)

Strongest relationship discovered

In my opinion the strongest relationship that I kept discovering over different plots, was the amount of contributions received, that was very clearly “won” by Sanders, Bernard.


Relationships observed in the multivariate part of the investigation. Features that strengthened each other in terms of looking at your feature(s) of interest

Interesting or surprising interactions between features

OPTIONAL: Models created with the dataset. Strengths and limitations of the model.


Final Plots and Summary

Plot One

Description One

Plot Two

Description Two

Plot Three

Description Three


Reflection

References:

R:

ggplot:

US politics: